N-gram Overlap in Automatic Detection of Document Derivation

نویسنده

  • Siniša Bosanac
چکیده

Establishing authenticity and independence of documents in relation to others is not a new problem, but in the era of hyper production of e-text it certainly gained even more importance. There is an increased need for automatic methods for determining originality of documents in a digital environment. The method of n-gram overlap is only one of several methods proposed by the literature and is used in a variety of systems for automatic identification of text reuse. Although the aforementioned method is quite trivial, determining the length of n-grams that would be a good indicator of text reuse is a somewhat complex issue. We assume that the optimal length of n-grams is not the same for all languages but that it depends on the particular language properties such as morphological typology, syntactic features, etc. The aim of this study is to find the optimal length of n-grams to be used for determining document derivation in Croatian language. Among the potential areas of implementation of the results of this study, we could point out automatic detection of plagiarism in academic and student papers, citation analysis, information flow tracking and event detection in on-line texts.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Similarity Overlap Metric and Greedy String Tiling at PAN 2012: Plagiarism Detection

This paper reports the best performed approach followed for the candidate document retrieval task and the approach used for the detailed comparison task of the Plagiarism detection track in PAN 2012. The aim of the participation was to understand a few of the computer-assisted approaches used for plagiarism detection. The plagiarism detection is dependent on two broad tasks, (1) the candidate d...

متن کامل

A survey on Automatic Text Summarization

Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...

متن کامل

Automatic Summarization from Multiple Documents

This work reports on research conducted on the domain of multi-document summarization using background knowledge. The research focuses on summary evaluation and the implementation of a set of generic use tools for NLP tasks and especially for automatic summarization. Within this work we formalize the n-gram graph representation and its use in NLP tasks. We present the use of n-gram graphs for t...

متن کامل

Detecting Derivatives using Specific and Invariant Descriptors

This paper explores the detection of derivation links between texts (otherwise called plagiarism, near-duplication, revision, etc.) at the document level. We evaluate the use of textual elements implementing the ideas of specificity and invariance as well as their combination to characterize derivatives. We built a French press corpus based on Wikinews revisions to run this evaluation. We obtai...

متن کامل

Evaluating Summaries and Answers: Two Sides of the Same Coin?

This paper discusses the convergence between question answering and multidocument summarization, pointing out implications and opportunities for knowledge transfer in both directions. As a case study in one direction, we discuss the recent development of an automatic method for evaluating definition questions based on n-gram overlap, a commonlyused technique in summarization evaluation. In the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011